Georeferenced or georeferencable data are increasingly available on the web. However, identifying relevant information for geospatial applications and analysis requires sifting through different data types as well as various levels of organization and accuracy. Goodchild’s prediction that we are increasingly becoming sensor is increasingly becoming a reality with increasing numbers of volunteered photos, reviews, or delivered tweets that include location information identifying where the content was captured, accessed, or recorded. Such data represent a large geographically representative and likely socially diverse data source.
For example, websites often have business addresses that can be mined for analysis of their location. Such platforms require more intensive tailored algorithms to collect this data unlike social media which is often accompanied by an Application Programming Interfaces(APIs). APIs are specially designed protocols that allow developers to access web content from specific platforms (e.g. twitter, facebook). While such interfaces can enables scraping of this content, they also contain data limits and in some cases restriction to prevent large amounts of data collection that would tax servers. Such limits are in place to protect the anonymity of users as well as to protect data that is increasingly being commodified.
Web scraping is the process of retrieving and parsing content from the web. In its simplest form, web scraping is a copy-and-paste operation, manually identifying a webpage’s content and organizing this information into relational databases. In fact, copy and paste techniques can be a solution when websites have barriers preventing machine automated scraping. Much of the time, however, the underlying structure of websites makes it possible to automate such procedures, for more efficient collection of the content.
Web scraping requires deciphering the structure of a website and developing algorithms that parse relevant content. Most websites display content organized based on the structure of a database, making it possible to develop algorithms that systematically query this data and extract it as needed. For example, on tourism advertising website, the phone number, address, and name will likely be stored according to a systematic hierarchy. Through detection of these templates, parsing programs or wrappers are able to translate this content into a relational form. Wrappers can be designed to interpret Hypertext Markup Language (HTML) codes, web browser schemes (Document object Model – DOM), the Uniform Resource Locator (URL) common scheme, and semantic markups (e.g semantic annotation recognition). Semantic markups simplify the development of wrappers because tags are known beforehand or are easily interpreted. Extensible Markup Language (XML), and JavaScript Object Notation (JSON) are particularly prominent annotations. For example, JSON annotations is used by Google Maps to encode address data including tags that indicate place names, address and coordinates. Algorithm development is rarely straightforward as data on the web is often unstructured or loosely structured, containing superfluous information that needs to be filtered out in order to obtain georeferenced or georeferencable web content.
Web crawling is the automated and methodical browsing of the web using programs or automated scripts (web crawlers, web spiders or web robots) and can be used in conjunction with web scraping in order to obtain content from multiple web pages. Like websites, webpages are typically encoded similarly within page hierarchies by common scripts or templates, which allow for systematic information retrieval (provided these templates are utilized). Algorithms can be used to detect systematic differences in URLs to retrieve data from multiple pages. In addition to gathering information from a particular database accessible via the web, web crawling may involve identifying links within websites and following these links, and links from the linked website, until a desired network is reached.
High-level programming languages offer a large range of scripts or packages that can detect website templates for parsing web content based on text grepping and regular expression matching. Grepping isolates keywords or patterns based on the UNIX grep command or regular expression-matching. Carefully applied, this enables the user to capture specific, well-defined sections of a page’s HTML tree. Such techniques are helpful in processing the large amounts of data generated through web crawled or scraped data by sifting and parsing all (non)relevant key terms or number sequences.
In today’s lab we will be looking under the hood of website to give you some rudimentary understanding of web content. We will further give some suggestions on how to scrape this webcontent. Finally, I will demonstrate how to access the twitter API for scraping tweets.
Inspect we see a window that shows lines of code. What is this exactly? Well, it is basically the backend of the page. It organizes the graphics on the page(e.g. advertisements, pictures), as well as, information (e.g. the article content). This information is stored in a nested hierarchy.<div class="content__head content.... and then we have <p>, which in HTML means a paragraph. This infromation let us know where we might find the article text on The Guardian website. There are several different webscrapers that can do this for you, but in the exercise today we will only be working with APIs. APIs organize tags that can be quickly accessed by developers. Platforms structure their sites in ways that others can develop application based on their content. For example, if we go to the developer page of Twitter we find resources to access tweets.10.Now we fill out some details about how we will use the data. Confirm on the next page that everything looks good!
Apps button in the developers website and in the new window press Create an app.App details fill in a name, description, website url, click Enable. Sign in with Twitter and finally tell how it will be used (the other stuff is not required).Consumer API Keys and the Acess token and Token Secret. These can be found at the Key and tokens tab. Create the tokens and copy both the keys and tokens and paste them in a safe location which you will save for later. So now we are all set to map tweets… well actaully there is a rather large amount of coding involved if you go by the develpers website. Luckly for us there is an R package that helps us easily query Tweets.Is an R function designed to collect and organize Twitter data via Twitter’s REST and stream Application Program Interfaces (API), which can be found at the following URL: https://developer.twitter.com/en/docs. It allows us to listen to the current stream of Tweets, as well as, query from Twitter’s database of Tweets.
rtweet,tidytext, and stringr. We will also install ggspatial, sf and rnaturalearth, which are we might demonstrate at the end if we have time.#possible use an older version if having issue with
#Error: lexical error: invalid char in json text.
#devtools::install_version("rtweet", version = "0.6.7")
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
library(dplyr)
# text mining library
library(tidytext)
library(stringr)
library(maps)
library(tidyverse)
library(ggspatial)
library(sf)
library(rnaturalearth)
rtweet does this using a few lines of code. First we give the details of the App that we made on the developer website. This includes the name of the app and the consumer_key, consumer_secret, access_token, and access_secret. These should remain secret, and I have left these blank for this exercise.## Give the name of the App
appname <- "Geoscraping"
api_key <- "XXXXXXXXXXX"
api_secret_key <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_token <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
access_token_secret <- "XXXXXXXXXXXXXXXXXXXXXXXXXX"
## authenticate via web browser
token <- create_token(
app = appname,
consumer_key = api_key,
consumer_secret = api_secret_key,
access_token = access_token,
access_secret = access_token_secret)
q. The search is not exhaustive. Not all Tweets will be indexed or made available via the search interface. The specific query elements included in the rtweet function can be found here. The q argument is the character query that will be searched. For example, we can search q = "climate change" to look for Tweets with mentioning the word ‘climate’ and ‘change’ (Spaces act as AND). We can also use boolean operators in the search. For example, if we use q = "climate OR change" this will expand the serarch to Tweets that mention climate or change. Using single quotes q = '"climate change"' will return the specific term ‘climate change’Other important arguments in the function include: n=10, which specifies the number of tweets you want returned; type=, you want recent(default), popular, or a mix - mixed Tweets returned; include_rts if you want to include retweets e.g. include_rts = TRUE; and geocode, which specifies whether a specific geographic area. This is important for us as we want to grab Tweets with their coordinates. Use geocode = TRUE to map Tweets.
Let’s try to search for Tweets from SEAS. We will use the hastage, and indicate that we only want 5 tweets.
SEAS_tweets <- search_tweets(q = "#SEAS", n = 5)
## Searching for tweets...
## Finished collecting tweets!
## Here we can see some of the information that is returned from the query
names(SEAS_tweets)
## [1] "user_id" "status_id"
## [3] "created_at" "screen_name"
## [5] "text" "source"
## [7] "display_text_width" "reply_to_status_id"
## [9] "reply_to_user_id" "reply_to_screen_name"
## [11] "is_quote" "is_retweet"
## [13] "favorite_count" "retweet_count"
## [15] "hashtags" "symbols"
## [17] "urls_url" "urls_t.co"
## [19] "urls_expanded_url" "media_url"
## [21] "media_t.co" "media_expanded_url"
## [23] "media_type" "ext_media_url"
## [25] "ext_media_t.co" "ext_media_expanded_url"
## [27] "ext_media_type" "mentions_user_id"
## [29] "mentions_screen_name" "lang"
## [31] "quoted_status_id" "quoted_text"
## [33] "quoted_created_at" "quoted_source"
## [35] "quoted_favorite_count" "quoted_retweet_count"
## [37] "quoted_user_id" "quoted_screen_name"
## [39] "quoted_name" "quoted_followers_count"
## [41] "quoted_friends_count" "quoted_statuses_count"
## [43] "quoted_location" "quoted_description"
## [45] "quoted_verified" "retweet_status_id"
## [47] "retweet_text" "retweet_created_at"
## [49] "retweet_source" "retweet_favorite_count"
## [51] "retweet_retweet_count" "retweet_user_id"
## [53] "retweet_screen_name" "retweet_name"
## [55] "retweet_followers_count" "retweet_friends_count"
## [57] "retweet_statuses_count" "retweet_location"
## [59] "retweet_description" "retweet_verified"
## [61] "place_url" "place_name"
## [63] "place_full_name" "place_type"
## [65] "country" "country_code"
## [67] "geo_coords" "coords_coords"
## [69] "bbox_coords" "status_url"
## [71] "name" "location"
## [73] "description" "url"
## [75] "protected" "followers_count"
## [77] "friends_count" "listed_count"
## [79] "statuses_count" "favourites_count"
## [81] "account_created_at" "verified"
## [83] "profile_url" "profile_expanded_url"
## [85] "account_lang" "profile_banner_url"
## [87] "profile_background_url" "profile_image_url"
text, user_id, created_at, screen_name, favorite_count, retweet_count, and geo_coords. Let’s look at this data. (Feel free to explore)SEAS_tweets$text
## [1] "Strong #winds, rough #seas alert \n\n#WeatherForTheWeekAhead #WeatherUpdate #WeatherForecast #WeatherReport #RainyDay #lka #SriLanka @MeteoLK \n\nhttps://t.co/gwwSL0P3Yl"
## [2] "Beware of calm seas..\n#quotes #seas #shallow #deeper https://t.co/8PxRRXfUMT"
## [3] "#SEAS #Bears_of_Wall_Street SeaWorld: Bankruptcy Risk Is Real https://t.co/9QXJaahPfY"
## [4] "The best hikes are not always the longest but the one's with the best views! #nature #ocean #seas #lookout #beaches #mountains #naturephotographry #outdoors #hiker #backpacking #daytrip #outdoorfeverdirect https://t.co/kGY3nffV3T"
## [5] "First threshold value for #GoodEnvironmentalStatus to be adopted under the EU #MarineDirective. A great example of how sound science, political willingness and public pressure can come together to improve the state of European #seas and #ocean. \n\n#MSFD #EUBeachCleanup #OurOcean https://t.co/ZaswFiVXPK"
SEAS_tweets$user_id
## [1] "3244922072" "104223961" "770087685738340353"
## [4] "1280909896754454528" "1306130924975906816"
SEAS_tweets$created_at
## [1] "2020-09-21 17:51:05 UTC" "2020-09-21 10:04:30 UTC"
## [3] "2020-09-21 09:26:03 UTC" "2020-09-21 07:54:01 UTC"
## [5] "2020-09-21 07:53:11 UTC"
SEAS_tweets$screen_name
## [1] "sumanebot" "EmperatrizPages" "designyourinves" "outdoorfd"
## [5] "alice_belin"
SEAS_tweets$favorite_count
## [1] 0 1 0 0 3
SEAS_tweets$retweet_count
## [1] 2 0 0 0 0
SEAS_tweets$geo_coords
## [[1]]
## [1] NA NA
##
## [[2]]
## [1] NA NA
##
## [[3]]
## [1] NA NA
##
## [[4]]
## [1] NA NA
##
## [[5]]
## [1] NA NA
geocode = argument in the function. We have to specify the location of the tweets using coordinates. We specify the geographical limiter using the template “latitude,longitude,radius”. Here we use the coordinates ofgeo_tweets <- search_tweets("lang:en", geocode = "42.28139,83.74833,100mi", n = 5)
## Searching for tweets...
## Finished collecting tweets!
geo_tweets$geo_coords
## [[1]]
## [1] NA NA
##
## [[2]]
## [1] NA NA
##
## [[3]]
## [1] NA NA
##
## [[4]]
## [1] NA NA
##
## [[5]]
## [1] NA NA
geo_tweets <- search_tweets("lang:en", geocode = "42.28,-83.74,100mi", n = 1000)
## Searching for tweets...
## Finished collecting tweets!
## create lat/lng variables using all available tweet and profile geo-location data
geo_tweets_coord <- lat_lng(geo_tweets)
nrow(geo_tweets_coord)
## [1] 999
## plot state boundaries
par(mar = c(0, 0, 0, 0))
maps::map("state", fill = TRUE, col = "#ffffff",
lwd = .25, mar = c(0, 0, 0, 0),
xlim = c(-90, -82), y = c(41, 48))
## plot lat and lng points onto state map
with(geo_tweets_coord, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))
We can also access the live twitter stream. This include a random sample (approximately 1%) of the entire live stream of all Tweets.
## random sample for 30 seconds (default)
live <- stream_tweets(q="")
## Streaming tweets for 30 seconds...
## Finished streaming tweets!
## opening file input connection.
## closing file input connection.
nrow(live)
nrow(stream_tweets("", timeout = 10))
## Streaming tweets for 10 seconds...
## Finished streaming tweets!
## opening file input connection.
## closing file input connection.
New_Delhi <- stream_tweets(q="", geocode = "28.61,77.21,100km", timeout = 60)
## Streaming tweets for 60 seconds...
## Finished streaming tweets!
## opening file input connection.
## closing file input connection.
## Grab the coordinates
New_Delhi <- lat_lng(New_Delhi)
par(mar = c(0, 0, 0, 0))
maps::map("world", lwd = .25)
## plot lat and lng points onto state map
with(New_Delhi, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))
!is.na with the place_type tag, as the place object is always present when a Tweet is geo-tagged, while the coordinates object is only present (non-null) when the Tweet is assigned an exact location.## here we are filtering out NA
Delhi_coord <- filter(New_Delhi, !is.na(place_type))
## Now we change the data into a sf object. We define the coordinates
## xlim, ylim (longitude, latitude), and provide an appropriate projection
Delhi_coord <- st_as_sf(Delhi_coord, coords = c("lng", "lat"), crs = 4326)
## using ggplot and the ```library(rnaturalearth)```, which provides vector
## for countries and the globe, we can map the points
world <- ne_countries(scale = "medium", returnclass = "sf")
ggplot(data = world) +
geom_sf() +
geom_sf(data = Delhi_coord, size = 4, shape = 23, fill = "darkred") ### here we specify the size
### shape and color of the points to be mapped
In lab this week, we will be learning how to scrape Flickr photographs and how to use keywords in the captions to identify specific photographs.